Model Selection

Multimodal Transformer

# Multimodal Transformer

Jedi 7B 1080p GGUF

An image-text to text generation model based on the Transformer architecture, designed specifically for computer/GUI-related scenarios, with intelligent agent capabilities.

Text-to-Image English

lmstudio-community

GIT is a Transformer-based image-to-text generation model capable of generating descriptive text from input images.

PyTorch Supports Multiple Languages

Spaceexploreai Small Base Regression 27M

A deep learning-based investment prediction system utilizing Transformer architecture, integrating DeepSeep-V3 and LLama3 design structures for stock price trend forecasting and technical analysis.

Large Language Model Supports Multiple Languages

Microsoft Git Base

GIT is a Transformer-based generative image-to-text model capable of converting visual content into textual descriptions.

Image-to-Text Supports Multiple Languages

Stable Diffusion 3.5 Large Turbo

A text-to-image model based on Multimodal Diffusion Transformer (MMDiT), utilizing Adversarial Diffusion Distillation (ADD) technology to enhance image quality, typography, and complex prompt understanding.

Text-to-Image English

GIT is a Transformer-based image-to-text generation model capable of generating descriptive text from input images.

Transformers Supports Multiple Languages

Git Base Finetune

GIT is a Transformer-based generative image-to-text model capable of converting visual content into descriptive text.

Transformers Supports Multiple Languages

Textcaps Teste2

GIT is a Transformer-based image-to-text generation model trained on large-scale image-text pairs, capable of performing tasks such as image captioning and visual question answering.

Transformers Supports Multiple Languages

artificialguybr

Git Large R Textcaps

GIT is a dual-conditioned Transformer decoder based on CLIP image tokens and text tokens, designed for tasks such as image caption generation and visual question answering.

Transformers Supports Multiple Languages

Git Large R Coco

GIT is a Transformer-based generative image-to-text model capable of generating descriptive text from images.

Transformers Supports Multiple Languages

Git Large Vatex

GIT is a Transformer decoder conditioned on CLIP image tokens and text tokens, designed for tasks like image and video caption generation and visual question answering.

Transformers Supports Multiple Languages

Git Large Textvqa

GIT is a vision-language model based on a Transformer decoder, trained with dual conditioning on CLIP image tokens and text tokens, specifically optimized for TextVQA tasks.

Transformers Supports Multiple Languages

Git Large Vqav2

GIT is a Transformer decoder based on CLIP image tokens and text tokens, trained on large-scale image-text pairs, suitable for tasks like visual question answering.

Transformers Supports Multiple Languages

Git Large Textcaps

GIT is a dual-conditional decoder model based on Transformer, designed for tasks such as image caption generation and visual question answering.

Transformers Supports Multiple Languages

GIT is a Transformer decoder-based vision-language model capable of generating image captions and performing visual question answering

Transformers Supports Multiple Languages

GIT is a Transformer-based generative image-to-text model, with the base version fine-tuned on the VATEX dataset, suitable for tasks such as image and video caption generation.

Transformers Supports Multiple Languages

GIT is a dual-conditional Transformer decoder based on CLIP image tokens and text tokens for image-to-text generation tasks

Transformers Supports Multiple Languages

GIT is a Transformer decoder-based vision-language model trained with CLIP image tokens and text token conditioning, suitable for tasks like image captioning and visual question answering.

Transformers Supports Multiple Languages

Git Base Textcaps

GIT is a Transformer-based generative image-to-text model capable of converting visual content into descriptive text.

Transformers Supports Multiple Languages

GIT is a Transformer decoder based on CLIP image tokens and text tokens, used for tasks such as image caption generation and visual question answering.

Transformers Supports Multiple Languages

Vision Perceiver Conv

A general-purpose vision perceiver model pre-trained on ImageNet, utilizing convolutional preprocessing and Transformer architecture, supporting image classification tasks

Image Classification

S2t Small Mustc En Es St

A speech-to-text transformer model for end-to-end English to Spanish speech translation

Speech Recognition

Transformers Supports Multiple Languages

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase